The World Happiness Report 2018 by United Nations Sustainable Development Solutions Network (SDSN) based on data from the Gallup World Poll surveys was out a few days ago. I had come across such a report for the first time and it got me curious right away. How do you quantify something like happiness?
Basically, people in countries across the world are surveyed based on the following question:
The response to the question of respondents’ assessment of their current life based on an imaginary 11-point scale whereby 0 designates one’s worst possible life and 10 denotes the best possible life respondents can imagine for themselves. Based on the question “Please imagine a ladder with steps numbered from zero at the bottom to ten at the top. Suppose we say that the top of the ladder represents the best possible life for you, and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time, assuming that the higher the step the better you feel about your life, and the lower the step the worse you feel about it? Which step comes closest to the way you feel?”
It’s still not completely objective. How do you know how happy you are? Are you really as happy as you think you are? I mean what is happiness anyway? Without getting too philosophical, the question is a great way to measure happiness on a macro scale. There are however some questions like how representative the sample is but let’s assume that the sample indeed is representative. Luckily, they had made the summarized data publicly available.
Let’s get exploring!
First I load all the packages I need and the data.
pacman::p_load(tidyverse,
readxl,
GGally,
plotly,
rebus,
janitor,
corrplot,
mice,
caret,
naniar,
leaflet,
visdat)
df_happiness_raw <-
read_excel("data/WHR2018Chapter2OnlineData.xls") %>%
clean_names()
df_subregions <-
read_excel("data/WHR2018Chapter2OnlineData.xls", sheet = 5) %>%
clean_names() %>%
mutate_if(is.character, str_to_lower)
countries_continents <-
read_csv("data/Countries-Continents.csv") %>%
clean_names() %>%
mutate_all(str_to_lower)
Are there any missing values?
vis_miss(df_happiness_raw)
The data has a few missing values. I will impute those values using the mice package
#missing too much data as is visible
df_happiness <-
df_happiness_raw %>% select(-gini_index_world_bank_estimate)
dfs_imputed <- mice(df_happiness, m=1, maxit=20, meth='pmm', seed=1)
df_happiness <- complete(dfs_imputed, 1) %>% as_tibble()
I want to cross-reference continent. However, in the country-continent file I had downloaded, there were a few countries missing/with different names. I create a new variable called country recoded that fixes this with the sole purpose of getting the continent. The countries recoded will not be used for anything else and hence, I didn’t worry too much about its validity.
df_happiness <-
df_happiness %>%
rename(healthy_life_expectancy = healthy_life_expectancy_at_birth,
std_ladder = standard_deviation_of_ladder_by_country_year,
std_mean_ladder = standard_deviation_mean_of_ladder_by_country_year,
gini_index_world_bank_mean = gini_index_world_bank_estimate_average_2000_15,
gini_of_household_income = gini_of_household_income_reported_in_gallup_by_wp5_year) %>%
mutate(country = country %>% str_to_lower(),
country_recoded = case_when(
country == "russia" ~ "russian federation",
str_detect(country,"congo") ~ "congo",
country == "myanmar" ~ "burma (myanmar)",
country == "burkina faso" ~ "burkina",
str_detect(country,"china") ~ "china",
country == "kosovo" ~ "serbia",
str_detect(country,"cyprus") ~ "cyprus",
str_detect(country,"palest") ~ "israel",
str_detect(country,"somal") ~ "somalia",
country == "south korea" ~ "korea, south",
TRUE ~ country
)) %>%
left_join(countries_continents, by = c("country_recoded" = "country"))
vis_miss(df_happiness)
No more missing values.
Let’s do some EDA. Some maps are interactive (the legends can be used to filter!). So feel free to explore.
Happiness levels have remained largely constant over the years except for South America…
plt_year_happiness <-
df_happiness %>%
filter(year > 2005) %>%
ggplot(aes(year, life_ladder, fill = continent, label = country)) +
geom_jitter(alpha = .55, size = 4, width = .27, height = 0, pch=21) +
geom_smooth(method = "loess", se = F, aes(col = continent)) +
scale_x_continuous(breaks = 2005:2017) +
scale_y_continuous(breaks = 2:10) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
theme_light()
ggplotly(plt_year_happiness, tooltip = c("country", "continent"))
And the plot clearly shows the ranking of happiness levels. Although, the Nordic countries like Finland, Denmark and Norway are consistently among the top, Europe as a whole, is right there with North America and South America. Asia is about a step lower and Africa 2.
If we filter just one continent at a time, we can see that South America is the continent with the least variation in happiness levels. Asia has a huge amount of variation and so do Europe and North America. The Oceania countries are just Australia and New Zealand. Not too many countries to talk about there.
Admittedly, the mean is not the best reflector of happiness of a country. We need the variance too! We have 2 columns in the dataset named standard_deviation_of_ladder_by_country_year, standard_deviation_mean_of_ladder_by_country_year (standard error?). Not sure what the second means and assuming the first one is what I need, let’s see the countries with the most and least variation.
plt_year_std <-
df_happiness %>%
filter(year > 2005) %>%
ggplot(aes(year, std_ladder, fill = continent, label = country, size = life_ladder)) +
geom_jitter(alpha = .55, width = .27, height = 0, pch=21) +
scale_x_continuous(breaks = 2005:2017) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
theme_light()
ggplotly(plt_year_std, tooltip = c("country", "continent","life_ladder"))
Wow! Many countries have a standard deviation around 3 and a happiness level around 4! Imagine the happiness-inequality if you will in these countries.
On the other side, the happiest countries are the ones where all of them are general very happy and not where some are extremely happy and some are just moderately happy getting averaging each other out. Powerful idea.
Let’s confirm this hypothesis by looking at a plot of standard deviation versus mean for 2017.
plt_std_happiness <-
df_happiness %>%
filter(year == 2017) %>%
ggplot(aes(life_ladder, std_ladder, fill = continent, label = country, size = life_ladder)) +
geom_jitter(alpha = .65, width = .25, size = 3, height = 0, pch=21) +
geom_smooth(aes(group = 1), se = F) +
scale_x_continuous(breaks = 2005:2017) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
theme_light()
ggplotly(plt_std_happiness, tooltip = c("country", "continent","life_ladder"))
Let’s take a look at the correlation plot of all the variables in our dataset and explore relationships in detail.
cors <- cor(df_happiness %>% select_if(is.numeric), method = "spearman")
corrplot(cors)
highly_correlated_with_happiness <- cors["life_ladder",][order(-abs(cors["life_ladder",]))] %>% .[-1] %>% .[1:5]
highly_correlated_with_happiness
## log_gdp_per_capita std_mean_ladder healthy_life_expectancy
## 0.7942000 -0.7739672 0.7590268
## social_support delivery_quality
## 0.7471541 0.6583947
We already analyzed the negative correlation standard deviation has with happiness.
GDP per capita is the most highly correlated. No surprises here.
plt_gdp_happiness <-
df_happiness %>% filter(year > 2005) %>% #we don't have neough values
ggplot(aes(log_gdp_per_capita, life_ladder, col = continent, label = country)) +
geom_point(alpha = .8, size = 2) +
geom_smooth(method = "loess", aes(group = 1), se = F) +
facet_wrap(~year) +
scale_color_brewer(palette = "Set1") +
theme_light()
ggplotly(plt_gdp_happiness, tooltip = "country")
We could take a look at healthy life expectancy vs happiness too. Note that it is that’s highly correlated with GDP. Also note that, it is not just life expectancy. It’s healthy life expectancy! It matters how alive you are! I am not splitting by year in the next few plots.
df_happiness %>%
ggplot(aes(healthy_life_expectancy, life_ladder, col = continent)) +
geom_point(alpha = .8, size = 2.5) +
geom_smooth(method = "lm", aes(group = 1), se = F) +
scale_color_brewer(palette = "Set1") +
theme_light()
So is social support. However, it is interesting to look at what social support is: >Social support is the national average of the binary responses (either 0 or 1) to the Gallup World Poll (GWP) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
A sense of security and connection are common among happy people.
df_happiness %>%
ggplot(aes(social_support, life_ladder, col = continent)) +
geom_point(alpha = .8, size = 2.5) +
geom_smooth(method = "loess", aes(group = 1), se = F) +
scale_color_brewer(palette = "Set1") +
theme_light()
Okay, now to some machine learning. Let’s see if we can predict happiness!!
We split the data into 2 parts - year 2017 (which we’ll predict on) and all the other years (which we’ll build on using cross validation).
I am trying to model happiness using only qualities that can easily measured. For instance, variables like generosity, social support, perception of corruption etc are subjective. Also they are also collected in the same surveys as happiness and a model using them wouldn’t be too useful.
So the variables I am using:
df_happiness_caret <-
df_happiness %>%
filter(year != 2017) %>%
dplyr::select(life_ladder, continent, log_gdp_per_capita, healthy_life_expectancy, gini_of_household_income) %>%
as.data.frame()
df_test <- df_happiness %>% filter(year == 2017) %>% as.data.frame()
I am also centering, scaling and using box cox to remove skew as part of preprocessing.
pre_process_model <- preProcess(df_happiness_caret[,-1], method = c("center", "scale","BoxCox"))
df_happiness_caret <- predict(pre_process_model, df_happiness_caret)
df_test <- predict(pre_process_model, df_test)
I am trying out many models - random forests, ranger(a different implementation for RF), Radial SVM, xgboost, knn, elasticNet and a liner regression. Like Stated above, we use 10 fold cross validation to build and validate our models and let caret take care of parameter tuning.
tr <- trainControl(method = "cv", number = 10)
model_rf <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "rf",
importance = T)
model_ranger <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "ranger")
model_lm <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "lm")
model_xgb <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "xgbTree")
model_enet <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "enet")
model_svm <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "svmRadial")
model_knn <- train(life_ladder~.,
data = df_happiness_caret,
trControl = tr,
method = "knn")
results <- resamples(list(
rf = model_rf,
ranger = model_ranger,
svm = model_svm,
xgb = model_xgb,
enet = model_enet,
lm = model_lm,
knn = model_knn
))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rf, ranger, svm, xgb, enet, lm, knn
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.3127666 0.3221300 0.3291425 0.3336669 0.3386175 0.3656728 0
## ranger 0.2929264 0.3051686 0.3245497 0.3247907 0.3434719 0.3544299 0
## svm 0.3465727 0.3676187 0.3892549 0.3903614 0.3940063 0.4718116 0
## xgb 0.3517441 0.3652519 0.3955501 0.3852190 0.4019682 0.4089824 0
## enet 0.4556655 0.4736102 0.4793619 0.4874707 0.5015821 0.5311790 0
## lm 0.4407623 0.4728067 0.4932167 0.4883003 0.5052972 0.5233223 0
## knn 0.3352887 0.3583825 0.3757967 0.3802513 0.4027484 0.4225999 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.3913830 0.4181610 0.4290352 0.4349838 0.4494933 0.4851138 0
## ranger 0.3825692 0.3933982 0.4223554 0.4254409 0.4501151 0.4761334 0
## svm 0.4403932 0.4747652 0.4954691 0.4986484 0.5086341 0.6035310 0
## xgb 0.4549341 0.4607832 0.4872270 0.4876601 0.5111071 0.5296463 0
## enet 0.5730692 0.5981592 0.6098451 0.6135854 0.6243215 0.6768733 0
## lm 0.5755683 0.5853817 0.6208868 0.6141204 0.6368414 0.6499034 0
## knn 0.4485176 0.4630974 0.4885742 0.4907196 0.5216832 0.5281654 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.8216111 0.8354730 0.8541433 0.8506751 0.8627548 0.8804787 0
## ranger 0.8335487 0.8422982 0.8556106 0.8568371 0.8745577 0.8792201 0
## svm 0.7232042 0.7904282 0.8035862 0.8026814 0.8302326 0.8367161 0
## xgb 0.7764371 0.7912191 0.8162395 0.8120653 0.8332812 0.8409831 0
## enet 0.6512339 0.6688344 0.7024522 0.7014993 0.7334520 0.7445536 0
## lm 0.6463184 0.6801453 0.7056950 0.7023067 0.7299158 0.7463589 0
## knn 0.7803696 0.7976135 0.8162021 0.8103628 0.8215135 0.8385605 0
Perfect! Let’s tidy up the results a bit and compare the models.
(results_tidy <-
results$values %>%
gather(metric, value, 2:ncol(.)) %>%
spread(Resample, value) %>%
separate(metric, into = c("model", "metric"), sep = "~") %>%
gather(fold, value, starts_with("Fold")) %>%
mutate(fold = fold %>% str_extract(one_or_more(DGT)) %>% as.integer()) %>%
group_by(model, metric) %>%
mutate(mean = mean(value),
median = median(value)))
results_tidy %>%
filter(metric == "RMSE") %>%
ggplot(aes(fold, value, col = model)) +
geom_point() +
geom_line() +
scale_color_brewer(palette = "Set1") +
geom_hline(aes(yintercept = mean, col = model), alpha = 0.5, linetype = 2) +
theme_light()
It’s a tight competition between ranger and random forest but ranger is the winner! It has the lowest average RMSE over the ten folds.
Noe let’s add predictions of our model to the test set and compare the actual with the predicted results.
(df_test <-
df_test %>%
as_tibble() %>%
mutate(pred_life_ladder = predict(model_ranger, df_test)) %>%
select(country, life_ladder, pred_life_ladder, everything()))
plt_predicted_actual <-
df_test %>%
ggplot(aes(life_ladder, pred_life_ladder, col = continent, label = country)) +
geom_abline(intercept = 0, slope = 1, alpha = 0.5) +
geom_point(size = 2) +
scale_x_continuous(limits = c(2,8)) +
scale_y_continuous(limits = c(2,8)) +
theme_light()
ggplotly(plt_predicted_actual)
Our model does a pretty neat job predicting happiness. Except in the lower tail where it’s a little off, the overall predictions are seem decent. Some countries actually have spot on predictions! On an average it is about 0.4687635 steps off.
Let’s take a look at our best and worst predictions.
(test_acc <-
df_test %>%
select(country, continent, life_ladder, pred_life_ladder) %>%
mutate(resid = life_ladder - pred_life_ladder,
resid_abs = abs(resid),
closeness = ifelse(resid_abs < mean(resid_abs), "close", "not_close")))
The countries which our model is off by a step.
df_test %>%
filter(!near(pred_life_ladder, life_ladder, 1))
The countries that are predicted extremely well.
df_test %>%
filter(near(pred_life_ladder, life_ladder, .05))
And what does a machine learning model have to say about which variables it found most useful to predict?
plot(varImp(model_rf))
The continents are obviously very useful. But the most important predictor is GDP!
Let me know your comments, questions, feedback in the comments below.. Thanks for reading!